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Abstract 

We propose an optimization method for minimizing the finite sums of smooth con¬ 
vex functions. Our method incorporates an accelerated gradient descent (AGD) 
and a stochastic variance reduction gradient (SVRG) in a mini-batch setting. Un¬ 
like SVRG, our method can be directly applied to non-strongly and strongly con¬ 
vex problems. We show that our method achieves a lower overall complexity 
than the recently proposed methods that supports non-strongly convex problems. 
Moreover, this method has a fast rate of convergence for strongly convex prob¬ 
lems. Our experiments show the effectiveness of our method. 


1 Introduction 


We consider the minimization problem; 


minimize f{x) 


def 1 

n 


2=1 


( 1 ) 


where /i ,..., are smooth convex functions from to R.. In machine learning, we often en¬ 
counter optimization problems of this type, i.e., empirical risk minimization. For example, given 
a sequence of training examples (ai, bi ),..., (a„, bn), where ai G and bi G M. If we set 
fi{x) = then we obtain linear regression. If we set fi{x) = log{l + exp{—bix"^Ui)) 

{bi G { — 1,1}), then we obtain logistic regression. Each fi{x) may include smooth regularization 
terms. In this paper we make the following assumption. 

Assumption 1. Each convex function fi{x) is L-smooth, i.e., there exists L > 0 such that for all 
x,y G 

l|V/.(x)-V/.(j/)|| <L||a:-j/||. 


In part of this paper (the latter half of section 4), we also assume that f{x) is /r-strongly convex. 
Assumption 2. f(x) is ^.-strongly convex, i.e., there exists /i > 0 such that for all x,y G 

fix) > fiy) + (V/(y), a; - y) + ^\\x - yf. 

Note that it is obvious that L > fi. 


Several papers recently proposed effective methods (SAG lUlIll, SDCA ||3]|4l, SVRG Q, S2GD 1^ , 
Acc-Prox-SDCA Q, Prox-SVRG gl, MISO lUl, SAGA QOl, Acc-Prox-SVRG HU, mS2GD 112) 
for solving problem ([T]i. These methods attempt to reduce the variance of the stochastic gradient 
and achieve the linear convergence rates like a deterministic gradient descent when f{x) is strongly 
convex. Moreover, because of the computational efficiency of each iteration, the overall complex¬ 
ities (total number of component gradient evaluations to find an e-accurate solution in expectation) 
of these methods are less than those of the deterministic and stochastic gradient descent methods. 
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An advantage of the SAG and SAGA is that they support non-strongly convex problems. Al¬ 
though we can apply any of these methods to non-strongly convex functions by adding a slight 
L 2 -regularization, this modihcation increases the difficulty of model selection. In the non-strongly 
convex case, the overall complexities of SAG and SAGA are 0{{n + L)le). This complexity is 
less than that of the deterministic gradient descent, which have a complexity of 0(nL/e), and is a 
trade-off with 0(n^/l7/e) , which is the complexity of the AGD. 

In this paper we propose a new method that incorporates the AGD and SVRG in a mini-batch 
setting like Acc-Prox-SVRG im. The difference between our method and Acc-Prox-SVRG is that 
our method incorporates ifTSll . which is similar to Nesterov’s acceleration iflTl . whereas Acc-Prox- 
SVRG incorporates IfTSlI . Unlike SVRG and Acc-Prox-SVRG, our method is directly applicable to 
non-strongly convex problems and achieves an overall complexity of 



where the notation O hides constant and logarithmic terms. This complexity is less than that of 
SAG, SAGA, and AGD. Moreover, in the strongly convex case, our method achieves a complexity 

O (n + min {k, n^/K }) , 

where k is the condition number L//r. This complexity is the same as that of Acc-Prox-SVRG. 
Thus, our method converges quickly for non-strongly and strongly convex problems. 

In Section 2 and 3, we review the recently proposed accelerated gradient method ifTJj and the 
stochastic variance reduction gradient 13. In Section 4, we describe the general scheme of our 
method and prove an important lemma that gives us a novel insight for constructing specific algo¬ 
rithms. Moreover, we derive an algorithm that is applicable to non-strongly and strongly convex 
problems and show its quickly converging complexity. Our method is a multi-stage scheme like 
SVRG, but it can be difficult to decide when we should restart a stage. Thus, in Section 5, we intro¬ 
duce some heuristics for determining the restarting time. In Section 6, we present experiments that 
show the effectiveness of our method. 

2 Accelerated Gradient Descent 

We first introduce some notations. In this section, || • || denotes the general norm on Let 
d{x) : ^ K be a distance generating function (i.e., 1-strongly convex smooth function with 

respect to || • ||). Accordingly, we dehne the Bregman divergence by 

Vxiy) = d{y) - {d{x) -I- {Vd{x),y- x)), Vx,Vj/ £ 

where (,) is the Euclidean inner product. The accelerated method proposed in ||T3]| uses a gradient 
step and mirror descent steps and takes a linear combination of these points. That is, 

{Convex Combination) Xk+i '''fe-Zfe + (1 “ Tk)yk, 

{Cradient Descent) ^ arg min { {Vf{xk+i),y - Xk+i) + ^\\y - Xk+iW^ } , 

yeR'i ^ ^ 

{Mirror Descent) Zk+i ■>r- arg min { ak+i{Vf{xk+i), z - Zk) + Vz^,{z) } . 

zeR'^ 

Then, with appropriate parameters, f{yk) converge to the optimal value as fast as the Nesterov’s 
accelerated methods nmsi for non-strongly convex problems. Moreover, in the strongly convex 
case, we obtain the same fast convergence as Nesterov’s methods by restarting this entire procedure. 

In the rest of the paper, we only consider the Euclidean norm, i.e., || • || = || • || 2 - 

3 Stochastic Variance Reduction Gradient 

To ensure the convergence of stochastic gradient descent (SGD), the learning rate must decay to 
zero so that we can reduce the variance effect of the stochastic gradient. This slows down the 
convergence. Variance reduction techniques ||5]0[ll[l2l such as SVRG have been proposed to solve 
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this problem. We review SVRG in a mini-batch setting CUEl. SVRG is a multi-stage scheme. 
During each stage, this method performs m SGD iterations using the following direction, 

t'fe=V//,(a;fe)-V/r,(x) + V/(i), 

where i is a starting point at stage, k is an iteration index, = {ii,..., i;,} is a uniformly randomly 
chosen size b subset of {1, 2,..., n}, and //^ = ^ fij ■ Note that Vk is an unbiased estimator 

of gradient V f{xk)' E/^ [u/c] = V f{xk), where E/^, denote the expectation with respect to Ik- A 
bound on the variance of Vk is given in the following lemma, which is proved in the Supplementary 
Material. 

Lemma 1. Suppose Assumption\J\holds, and let a;, = argmin/(a;). Conditioned on Xk, we have 

E,, ||a;fc - VfixkW < 4L {f{xk) - f{x.) + /(i) - f{x,)) ■ (2) 

o(n — 1) 

Due to this lemma, SVRG with h —1 achieves a complexity of 0((n + k) log 2). 

4 Algorithms 

We now introduce om Accelerated efficient Mini-batch SVRG (AMSVRG) which incorporates AGD 
and SVRG in a mini-batch setting. Our method is a multi-stage scheme similar to SVRG. During 
each stage, this method performs several APG-like lfT3]l iterations and uses SVRG direction in a 
mini-batch setting. Each stage of AMSVRG is described in Figure [T| 


Algorithm l(yo, zp, m, p, {ak+i)kGi.+ , (&fc+i)/cgz+, (Tfc)fcsz+) 

v^iEti^niyo) 

for A: ^ 0 to TO 

Xk+l ^ (1 - Tk)yk + TkZk 

Randomly pick subset Ik+i C {1, 2,..., n} of size bk+i 

vk+i ^ ^hk+Axk+i) - yfik+iiyo)+ ^ 

Vk+i ^ argminj^gRd { ffivk+i-.V - Xk+i) + \\\y - Xk+iW^ } {SGD step) 
Zk+i ^ argmin^gRd { ak+i{vk+i,z - Zk) + } {SMD step) 

end 

Option I: Return ym-vi 

Option II: Return ^ 


Figure 1: Each stage of AMSVRG 


4.1 Convergence analysis of the single stage of AMSVRG 


Before we introduce the multi-stage scheme, we show the convergence of Algorithm 1. The fol¬ 
lowing lemma is the key to the analysis of our method and gives us an insight on how to construct 
algorithms. 

Lemma 2. Consider Algorithm 1 in Figure Q] under Assumption [7] We set Sk = Lcf 

x^, G argmin^^gRd f{x). If y = then we have. 


m . 1 . 

^ ak+i (-(1 -f ffik+i)Lak+i E[f{xk+i) - /(x*)] -f La^+iE[/(ym+i) - /(x*)] 

< 't4o(a^*) + Xl (“'=+1- — - Lal]E[f{yk) - fix,,)] 


/c=l 


+ +4L^afe+i4-ri j (/(yo) - fix*))- 

V fc=0 / 
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To prove Lemma |2] additional lemmas are required, which are proved in the Supplementary Mate¬ 
rial. 

Lemma3. (Stochastic Gradient Descent). Suppose Assumption\J\holds, and let r] = jj. Conditioned 
on Xk, it follows that for k > 1, 

< f{xk) - ^||V/(xfc)f + - S/fixkW. (3) 

Lemma 4. (Stochastic Mirror Descent). Conditioned on Xk, we have that for arbitrary u G 
ak(Vf(xk), Zk-i -u)< (u) - E/^ [14^ (u)] + ^al\\S/f{xkW + hk - V/(a;fe)f. 


Proof of Lemma^ We denote 14^ (a^*) by 14 for simplicity. From Lemma[T][3 and|4]with u = a;*, 
ak+i{S 7 f{xk+i),Zk - x^) 

< 14 - E/^+i [14+i] + La^+i(/(a;fc+i) - E/^^, [/(yfc+i)]) + Hufe+i - V/(a;fe+i)f 


< Vfc - E7,_^^ [14 +i] + Lal^^{f{xk+i) - E/,^^ [f(yk+i)]) 

(EJ 


dSU 

^ — 

+4La^+i4-ri(/(a:fc+i) - /(a;*) + f{yo) - /(a:*)) 

= Vfe - E/^^^ [ 14 +i] + (1 + 44+i)La^+i {f{xk+i) - /(a;,)) - f-afc-riE/,+i [fiVk+i) - f{x*)] 
+4La^+i4-ri(/(2/o) - /(a;*))- 

By taking the expectation with respect to the history of random variables /i, 4 ..., we have, 
afc+iE[(V/(a;fe+i),Zfe - a;,)] < E[14 - I 4 + 1 ] + (1 + 44+i)Lafe+iE[/(a;fe+i) - /(a;,)] 

-LaI+iE[/(yfe+i) - /(x*)] + 4La|+i4-ri(/(yo) - /(a:*)). (5) 

and we get 


^Q;fc+iE[/(xfc+i) -/(x*)] < ^afc+iE[(V/(xfe+i),Xfe+i - X*)] 

k^O 
m 

= ^afc-ri(E[(V/(xfe+i),Xfe+i - Zk)] + E[(V/{xfe+i),- x*)]) 


fc=0 




= E 


Q!fc+1 


fc =0 


1 - Tfc 

Tk 


E[{S/f(xk+i),yk - Xk+i)] +E[(V/(xfc+i),z/c - x*)] 


< 


^ / 1 — T \ 

^ ( ak+i - —E[f{yk) - f{xk+i)] + ak+iE[{\7f{xk+i),Zk - x*)] ) -(6) 


Using Q, (|6]l, and 14(,^i (x*) > 0, we have 


E Oik+l ( 1 + 


fc=0 


1 - Tfc 

Tk 


- (1 + 44+i)Lafc+i E[/(xfc+i) - /(x,)] 


< Vo + — —miVk) - fix*)] - LEafc+iE[/( 2 /fe-ri) - fix*)] 




Tk 




+4L E»Li4«(/(»o)-/(x.)). 


k=0 


This completes the proof of Lemma|2l 


From now on we consider Algorithm 1 with option 1 and set 


□ 


V=Y: oik+i = ^ik + 2), —=Lak+i + lr, f or k = 0,1,.... (7) 

1 j ^Lj ^ 
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Theorem 1. Consider Algorithm 1 with option 1 under Assumption\I\ For p G (O, we choose 
G Z_|_ such that 4,L5k+iOik+i < P- Then, we have 

IE[/{2/m+i) - f{x„)] < + |p(/(?/o) - /(a;*))- 

(m + zy z 

Moreover, if m > Y^ q(/fao)°-/(I )) " 9 > 0, f/zen it follows 

miy-m+i) - /(a;*)] < (^ 9 + (/(yo) - /(a;*)). 


Proof. Using Lemma|2]and 


1 


To = 1,-(1 + 44+i)La/c+i > 0 

Tk 

^ — Xk J- 2 r„2 1 

— Lak — Uftfe+i — t; 

‘^k ^ 


Q!/c+ 1 - 


- Lai = Lai^^ - ^Uk+i - Lai = < 0 ) 


we have 


Lal,+l^.f{ym+l) - /(a;*)] < ^zo(a^*) + afc+i4+i(/(yo) - /(a:*))- 




This proves the theorem because Q;^+i4+i < pJ2T=o ^k+i < + 2)^. 


□ 


Let bk+i,rn G Z+ be the minimum values satisfying the assumption of Theorem [T] for p = q = e, 

-i(fc+2) 


i.e., bk+i = 


e(n—l)+/c+2 


and m = 


LVzoix.) 


Then, from Theorem [T] we have an 


4fivo)-f{x,)) 

upper bound on the overall complexity (total number of component gradient evaluations to obtain 
e-accurate solution in expectation); 


fe=0 


O (n + ^4+1 I <0(n + m-£^) = O [n 


nL 


e^n + sfeL J 




where we used the monotonicity of bk+i with respect to k for the first inequality. Note that the 
notation O also hides T4(,(a;,) and f{yo) — /(a;*). 


4.2 Multi-Stage Scheme 

In this subsection, we introduce AMSVRG, as described in Figure |2l We consider the convergence 


Algorithm2 (wq, (ms)sgz+; V, (,ak+i)kei.+ j (^fc+i)fcgz+, (a'fc)fcgz+) 

for s ^ 0, 1,... 

yo ^ Ws, zo ^ Ws 

Ws+i ^ Algorithml( 2 /o, zq, rus, y, iak+i)kGi.+ , (4+i)/cez+, (Tk)kGZ+) 


Figure 2; Accelerated efficient Mini-batch SVRG 

of AMSVRG under the following boundedness assumption which has been used in a several papers 
to analyze incremental and stochastic methods (e.g., DSEl). 

Assumption 3. (Boundedness) There is a compact subset U C K.'^ such that the sequence {ws} 
generated by AMSVRG is contained in Q. 

Note that, if we change the initialization of zq ^ to zq ■<— z : constant, the above method 
with this modification will achieve the same convergence for general convex problems without the 
boundedness assumption (c.f. supplementary materials). However, for the strongly convex case, this 
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modified version is slower than the above scheme. Therefore, we consider the version described in 
Figure |2] 


From Theoremd] we can see that for small p and q (e.g. p = 1/10, q = l /4), the expected value of 
the objective function is halved at every stage under the assumptions of Theorem[T] Hence, running 
AMSVRG for 0(log(l/e)) outer iterations achieves an e-accurate solution in expectation. Here, 
we consider the complexity at stage s to halve the expected objective value. Let bk+i,ms S 


n(fe+2) 
p(n—l)+fc+2 


and 


be the minimum values satisfying the assumption of Theorem [T] i.e., bk+i = 

If the initial objective gap f{ws) — f{x^,) in stage s is larger than e. 


LV„^(x,) 

then the complexity at stage is 


O 


bk+i I < O f n + 


fe=0 


nrrig 

n + rus 


= 0\n 


nL 


n{f{ws) - /(a;*)) + y/{f{ws) - f{x^))L 


< O { n + 


nL 


en 




where we used the monotonicity of bk+i with respect to k for the first inequality. Note that by 
Assumption[2 {14;* (a:*)}s=i,2,... are uniformly bounded and notation O also hides 14,^ (a;*). The 
above analysis implies the following theorem. 


Theorem 2. Consider AMSVRG under Assumptions\I]and\^ We set p, a^+i, and Tk as in Let 
bk+i = p(^n-iy+k +2 “ ‘^\l a(ffw7)-^f}x A ’ tv/iere p and q are small values described 


“ - "V 9(/44-/4.)) . 

above. Then, the overall complexity to run AMSVRG for 0(log(l/e)) outer iterations or to obtain 
an e-accurate solution is 


O 



nL 


en + \feL 



Next, we consider the strongly convex case. We assume that / is a /r-strongly convex function. In 
this case, we choose the distance generating function d{x) = i||a;|p, so that the Bregman divergence 
becomes I4(t/) = ^Ha; — y|p. Let the parameters be the same as in Theorem|3 Then, the expected 
value of the objective function is halved at every stage. Because ms < where k is the 

condition number L/p, the complexity at each stage is 


O I n + bk+i 1 < O 

\ fc=o / 


n + TO= 


< O ( n + 


nn 


s/k 


Thus, we have the following theorem. 


Theorems. Consider AMSVRG under Assumptions\I]and^ Let parameters r],ak+i,Tk,mg, and 
bk+i be the same as those in Theorem \2] Then the overall complexity for obtaining e-accurate 
solution in expectation is 


O 




This complexity is the same as that of Acc-Prox-SVRG. Note that for the strongly convex case, we 
do not need the boundedness assumption. 


Table [T] lists the overall complexities of the AGD, SAG, SVRG, SAGA, Acc-Prox-SVRG, and 
AMSVRG. The notation O hides constant and logarithmic terms. By simple calculations, we see 
that 


nn 1 _ 

-— 1= = -xH[K,nVK ), 
+ VK Z 


nL 


en -\- sfeL 


= t:H\ -,n\ - 



where H{-, •) is the harmonic mean whose order is the same as minj-, •}. Thus, as shown in Table 
[T] the complexity of AMSVRG is less than or equal to that of other methods in any situation. In 
particular, for non-strongly convex problems, our method potentially outperform the others. 
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Table 1; Comparison of overall complexity. 


Convexity 

Algorithm 

Complexity 


AGD 


General convex 

SAG, SAGA 

d{^) 


SVRG, Acc-SVRG 

— 


AMSVRG 

d(n + min|^,nyf 1) 


AGD 

d {n^/n) 

Strongly convex 

SAG 

O (max{n, k}) 


SVRG 

0{n + k) 


Acc-SVRG, AMSVRG 

O {n + min {k, riy/K }) 


5 Restart Scheme 


The parameters of AMSVRG are essentially p, and {i.e., p) because the appropriate values 
of both ak+i and can be expressed by p = 1/L as in (|7]i. It may be difficult to choose an 
appropriate TOs which is the restart time for Algorithm 1. So, we propose heuristics for determining 
the restart time. 


First, we suppose that the number of components n is sufficiently large such that the complexity of 
our method becomes 0{n). That is, for appropriate rris, 0{n) is an upper bound on J2T=o ^k+i 
(which is the complexity term). Therefore, we estimate the restart time as the minimum index 
m G that satisfies J2T=o ^k+i > n. This estimated value is upper bound on mg (in terms of the 
order). In this paper, we call this restart method R1. 


Second, we propose an adaptive restart method using S VRG. In a strongly convex case, we can easily 
see that if we restart the AGD for general convex problems every a/k, then the method achieves 
a linear convergence similar to that for strongly convex problems. The drawback of this restart 
method is that the restarting time depends on an unknown parameter k, so several papers lfT8l420ll 
have proposed effective adaptive restart methods. Moreover, ED showed that this technique also 
performs well for general convex problems. Inspired by their study, we propose an SVRG-based 
adaptive restart method called R2. That is, if 

{vk+i^Vk+i - y/c) > 0, 

then we return yk and start the next stage. 


Third, we propose the restart method R3, which is a combination of the above two ideas. When 
SfeLo ^k+i exceeds lOn, we restart Algorithm 1, and when 

m 

(t;fc+r,t/fc+r - 2/fe) > 0 A bk+i > n, 


we return yk and restart Algorithm 1. 


6 Numerical Experiments 

In this section, we compare AMSVRG with SVRG and SAGA. We ran an L2-regularized multi-class 
logistic regularization on mnist and covtype and ran an L2-regularized binary-class logistic regular¬ 
ization on rcvl. The datasets and their descriptions can be found at the LIBSVM websit^B In these 

’ http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ 


7 









A 


mnist 


covtype 


rcvl 


10-5 


10-5 


10 -^ 


0 







- - AMSVHG R1 

-- AMSVRG R2 

— AMSVRG R3 

‘ SVRG 

SAGA 











Figure 3: Comparison of algorithms applied to L2-regularized multi-class logistic regularization 
(left: mnist, middle: covtype), and L2-regularized binary-class logistic regularization (right: rcvl). 


experiments, we vary regularization parameter A in {0, IQ-^, 10-®, lO-^}. We ran AMSVRG 
using some values of rj from [10-^, 5 x 10] and p from [IQ-^, 10], and then we chose the best j] 
and p. 

The results are shown in Figure [3] The horizontal axis is the number of single-component gradient 
evaluations. Our methods performed well and outperformed the other methods in some cases. For 
mnist and covtype, AMSVRG R1 and R3 converged quickly, and for rcvl, AMSVRG R2 worked 
very well. This tendency was more remarkable when the regularization parameter A was small. 

Note that the gradient evaluations for the mini-batch can be parallelized 112114231 . so AMSVRG may 
be further accelerated in a parallel framework such as GPU computing. 


7 Conclusion 


We propose method that incorporates acceleration gradient method and the SVRG in the increasing 
mini-batch setting. We showed that our method achieves a fast convergence complexity for non- 
strongly and strongly convex problems. 
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Supplementary Materials 


A Proof of the Lemma U 


To prove Lemma[Tl the following lemma is required, which is also shown in |fT| . 

Lemma A. Let be a set of vectors in and n denote an average of{^i}2^i- Let I denote 

a uniform random variable representing a size b subset of {1,2,, n}. Then, it follows that, 

2 

d = T7Z — 


E/ 




iei 


b(n — 1) 


Proof. We denote a size b subset of {1, 2,..., n} by S' = {ii,..., ib} and denote — p hy ^i. 
Then, 

2 
L 

1 


E/ 






i€l 


C{n,b) 


E 


1 


b‘^C{n, b) 


1 


E 


7 , E^*j 

i=i 

b 


b 


62(7(71,6) E I E II E I ’ 

^ ^ s \i=l j,k,j<k 


where C{-,-) is a combination. By symmetry, an each appears 


bC {n,b) 


times and an each pair 


I?0 J appears times in Therefore, we have 


E/ 




i£l 


h‘^C{n^h) \ n 


6^(71,6) ^,,-,,2 , 2C{b,2)C{n,b) ^ 
+ (7(77,2) ^ 


i=l 




1 "■ 


2(6 - 1) ZTf 

_ It 2 ^ ^ 3 - 


bn ^ ' bn(n — 1) 

1=1 1,3,i<3 

Since, 0 = || YJi=i fif = Er=i ll^tf + 2 IT we have 


E, 




iG/ 


1 6-1 


bn bn{n — 1) 


n 


bin —In 

i—l ^ ' i—1 


This finishes the proof of Lemma. 
We now prove the Lemma[T] 


□ 


Proof of Lemma{T\. We set v{ = V fj{xk) — V fj{x) + v. Using Lemma A and 


1 1 
3 elk 


conditional variance of Vk is as follows 


E/.||ufc - VfixkW = - VfixkW, 

b n — 1 3 

where expectation in right hand side is taken with respect to j £ {1,..., ti}. By Corollary 3 in ||2l. 
it follows that. 


%ll^l - V/(xfe)||^ < 4:L{f{xk) - f{x„) + f{x) - f{x„)). 


This completes the proof of Lemma[T] 


□ 
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B Stochastic gradient descent analysis 

Below is the proof of Lemma [3 

Proof of Lemma\^. It is clear that yk is equal to Xk — yvk- Since f{x) is L-smooth and p = jj, we 
have, 

fiVk) < f{xk) + {yf{xk),yk-xk) + ^\\yk-xk\\‘^ 

= f{Xk) - ^{yf{Xk), Vk) + • 

Vk is an unbiased estimator of gradient V f{xk), that is, E/^ [vk] — ^ f{xk)- Hence, we have 

= \\yf{xk)\\^ + Eij\vk-yf{xkW. 

Using above two expressions, we get 

^I.ifiyk)] = f{xk)-^\\Vf{xkW + ^EkJvkr 

= f{xk) - ^\\yf{xkW + - Vf{xk)f. 

□ 


C Stochastic mirror descent analysis 

We give the proof of Lemma @1 


Proof of Lemma^. The following are basic properties of Bregman divergence. 


(V14(t/),'u - y) = K{u) - Vy{u) - Vx{y), (8) 

14(t/) > ^||x-yf. (9) 

Using ([8]) and (|9l), we have 


ak{vk,Zk-i - u) = 


© 

< 

© 

< 


ak{vk, Zk-i - Zk) + ak{vk, Zk - u) 
akivk,Zk-i - Zk) - {'^Vz^_^{zk),Zk - u) 
akivk,Zk-i - Zk) + 14fc_i(u) - 14Ju) - Vz^_^{zk) 

ak{vk,Zk-i - Zk) - ^ll-Zfc-i - ZkW^ + - 14fc(u) 

^alWvkf + 14,_i(w) - V^kiu), 


where for the second equality we use stochastic mirror descent step, that is, akVk + V14;,_^ (zk) = 0 
and for the last inequality we use the Fenchel-Young inequality ak{vk, Zk-i — Zk) < + 

i||Zfe_l - Zk\f. 

By taking expectation with respect to Ik and using E/^ HffelP = II V/(xfe)|p TE/^, 11?;^ — Vf{xk)\\'^, 
we have 

afc(V/(xfe), Zk-i -u)< (u) - E/^ (u)] + ^al\\W f{xk)\\'^ + ||ufc - V/(a;fe)f . 

This hnishes the proof of Lemma© □ 
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Algorithm3 (wq, (ms)sgz+, V, (<afc+i)fcgz+, (^fc+i)/cgz+, {Tk)k&+) 

for s ^ 0, 1,... 

2/0 ^ Ws, zo ^ Wo 

Ws+i ^ Algorithml(yo, zq, rus, rj, {ak+i)kGi.+ , (&fe+i)fcez+, {Tk)kGi.+ ) 


Figure 4: Modified AMSVRG 


D Modified AMSVRG for general convex problems 


We now intrdouce a modified AMSVRG (described in Figure |4]l that do not need the boundedness 
assumption for general convex problems. We set p, afc+i, and Tk as in (|2l). Let bk+i S be the 


minimum values satisfying 4L<5fc+iafc+i < p for small p {e.g. 1/4). Let nis 
From Thorem[Tl we get 


4 




E[f{ws+i) - /(x*)] < e + a{f{ws) - /(x*)), 

where a = |p. Thus, it followis that, 

S 

E[f{ws+i)-f{x^,)] < y^a*e + o^+^(/(u;o) - fjx^,)) 

t=o 

< 7 ^e + a'*+^(/(wo) -/(x,)). 

I — a 

Hence, running the modified AMSVRG for O (log -1) outer iterations achieves e-accurate solution 
in expectation, and a complexity at each stage is 


O 

= O 







nL 

en + '/eL 



n + iris 




where we used the monotonicity of bk+i with respect to k for the first inequality. Note that 14p (x*) 
is constant (i.e. (x*)), and O hides this term. From the above analysis, we derive the following 

theorem. 

Theorem 4. Consider the modified AMSVRG under Assumptions\J] Let parameters be as above. 
Then the overall complexity for obtaining e-accurate solution in expectation is 
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